{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# CHAPTER 10 Data Aggregation and Group Operations（数据汇总和组操作）\n",
    "\n",
    "\n",
    "这一章的内容：\n",
    "\n",
    "- 把一个pandas对象（series或DataFrame）按key分解为多个\n",
    "- 计算组的汇总统计值（group summary statistics），比如计数，平均值，标准差，或用户自己定义的函数\n",
    "- 应用组内的转换或其他一些操作，比如标准化，线性回归，排序，子集选择\n",
    "- 计算透视表和交叉列表\n",
    "- 进行分位数分析和其他一些统计组分析\n",
    "\n",
    "# 10.1 GroupBy Mechanics（分组机制）\n",
    "\n",
    "Hadley Wickham，是很多R语言有名库的作者，他描述group operation(组操作)为split-apply-combine(分割-应用-结合)。第一个阶段，存储于series或DataFrame中的数据，根据不同的keys会被split（分割）为多个组。而且分割的操作是在一个特定的axis（轴）上。例如，DataFrame能按行（axis=0）或列（axis=1）来分组。之后，我们可以把函数apply（应用）在每一个组上，产生一个新的值。最后，所以函数产生的结果被combine(结合)为一个结果对象（result object）。下面是一个图示：\n",
    "\n",
    "![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/ikthz.png)\n",
    "\n",
    "每一个用于分组的key能有很多形式，而且keys也不必都是一种类型：\n",
    "\n",
    "- 含有值的list或array的长度，与按axis分组后的长度是一样的\n",
    "- 值的名字指明的是DataFrame中的列名\n",
    "- 一个dict或Series，给出一个对应关系，用于对应按轴分组后的值与组的名字\n",
    "- 能在axis index（轴索引）上被调用的函数，或index上的labels（标签）\n",
    "\n",
    "注意后面三种方法都是用于产生一个数组的快捷方式，而这个数组责备用来分割对象（split up the object）。不用担心这些很抽象，这一章会有很多例子来帮助我们理解这些方法。先从一个例子来开始吧，这里有一个用DataFrame表示的表格型数据集："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>data1</th>\n",
       "      <th>data2</th>\n",
       "      <th>key1</th>\n",
       "      <th>key2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.364533</td>\n",
       "      <td>0.633262</td>\n",
       "      <td>a</td>\n",
       "      <td>one</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.353368</td>\n",
       "      <td>0.361008</td>\n",
       "      <td>a</td>\n",
       "      <td>two</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.253311</td>\n",
       "      <td>-1.107940</td>\n",
       "      <td>b</td>\n",
       "      <td>one</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-1.513444</td>\n",
       "      <td>-1.038035</td>\n",
       "      <td>b</td>\n",
       "      <td>two</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-0.920317</td>\n",
       "      <td>2.037712</td>\n",
       "      <td>a</td>\n",
       "      <td>one</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      data1     data2 key1 key2\n",
       "0  1.364533  0.633262    a  one\n",
       "1  1.353368  0.361008    a  two\n",
       "2  0.253311 -1.107940    b  one\n",
       "3 -1.513444 -1.038035    b  two\n",
       "4 -0.920317  2.037712    a  one"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],\n",
    "                   'key2' : ['one', 'two', 'one', 'two', 'one'], \n",
    "                   'data1' : np.random.randn(5), \n",
    "                   'data2' : np.random.randn(5)})\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "假设我们想要，通过使用key1作为labels，来计算data1列的平均值。有很多方法可以做到这点，一种是访问data1，并且使用列（a series）在key1上，调用groupby。(译者：其实就是按key1来进行分组，但只保留data1这一列)："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "grouped = df['data1'].groupby(df['key1'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<pandas.core.groupby.SeriesGroupBy object at 0x111db3710>"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grouped"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这个grouped变量是一个GroupBy object(分组对象)。实际上现在还没有进行任何计算，除了调用group key(分组键)`df['key1']`时产生的一些中间数据。整个方法是这样的，这个GroupBy object(分组对象)已经有了我们想要的信息，现在需要的是对于每一个group（组）进行一些操作。例如，通过调用GroupBy的mean方法，我们可以计算每个组的平均值："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "key1\n",
       "a    0.599194\n",
       "b   -0.630067\n",
       "Name: data1, dtype: float64"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grouped.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "之后我们会对于调用.mean()后究竟发生了什么进行更详细的解释。重要的是，我们通过group key（分组键）对数据（a series）进行了聚合，这产生了一个新的Series，而且这个series的索引是key1列中不同的值。\n",
    "\n",
    "得到的结果中，index（索引）也有'key1'，因为我们使用了df['key1']。\n",
    "\n",
    "如果我们传入多个数组作为一个list，那么我们会得到不同的东西："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "key1  key2\n",
       "a     one     0.222108\n",
       "      two     1.353368\n",
       "b     one     0.253311\n",
       "      two    -1.513444\n",
       "Name: data1, dtype: float64"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "means = df['data1'].groupby([df['key1'], df['key2']]).mean()\n",
    "means"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这里我们用了两个key来分组，得到的结果series现在有一个多层级索引，这个多层索引是根据key1和key2不同的值来构建的："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>key2</th>\n",
       "      <th>one</th>\n",
       "      <th>two</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>key1</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>a</th>\n",
       "      <td>0.222108</td>\n",
       "      <td>1.353368</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>b</th>\n",
       "      <td>0.253311</td>\n",
       "      <td>-1.513444</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "key2       one       two\n",
       "key1                    \n",
       "a     0.222108  1.353368\n",
       "b     0.253311 -1.513444"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "means.unstack()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在上面的例子里，group key全都是series，即DataFrame中的一列，当然，group key只要长度正确，可以是任意的数组："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "years = np.array([2005, 2005, 2006, 2005, 2006])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "California  2005    1.353368\n",
       "            2006    0.253311\n",
       "Ohio        2005   -0.074456\n",
       "            2006   -0.920317\n",
       "Name: data1, dtype: float64"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['data1'].groupby([states, years]).mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<pandas.core.groupby.SeriesGroupBy object at 0x112530e48>"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['data1'].groupby([states, years])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    1.364533\n",
       "1    1.353368\n",
       "2    0.253311\n",
       "3   -1.513444\n",
       "4   -0.920317\n",
       "Name: data1, dtype: float64"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['data1']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>data1</th>\n",
       "      <th>data2</th>\n",
       "      <th>key1</th>\n",
       "      <th>key2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.364533</td>\n",
       "      <td>0.633262</td>\n",
       "      <td>a</td>\n",
       "      <td>one</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.353368</td>\n",
       "      <td>0.361008</td>\n",
       "      <td>a</td>\n",
       "      <td>two</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.253311</td>\n",
       "      <td>-1.107940</td>\n",
       "      <td>b</td>\n",
       "      <td>one</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-1.513444</td>\n",
       "      <td>-1.038035</td>\n",
       "      <td>b</td>\n",
       "      <td>two</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-0.920317</td>\n",
       "      <td>2.037712</td>\n",
       "      <td>a</td>\n",
       "      <td>one</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      data1     data2 key1 key2\n",
       "0  1.364533  0.633262    a  one\n",
       "1  1.353368  0.361008    a  two\n",
       "2  0.253311 -1.107940    b  one\n",
       "3 -1.513444 -1.038035    b  two\n",
       "4 -0.920317  2.037712    a  one"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "其中分组信息经常就在我们处理的DataFrame中，在这种情况下，我们可以传入列名（可以是字符串，数字，或其他python对象）作为group keys："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>data1</th>\n",
       "      <th>data2</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>key1</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>a</th>\n",
       "      <td>0.599194</td>\n",
       "      <td>1.010661</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>b</th>\n",
       "      <td>-0.630067</td>\n",
       "      <td>-1.072987</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         data1     data2\n",
       "key1                    \n",
       "a     0.599194  1.010661\n",
       "b    -0.630067 -1.072987"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby('key1').mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>data1</th>\n",
       "      <th>data2</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>key1</th>\n",
       "      <th>key2</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">a</th>\n",
       "      <th>one</th>\n",
       "      <td>0.222108</td>\n",
       "      <td>1.335487</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>two</th>\n",
       "      <td>1.353368</td>\n",
       "      <td>0.361008</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">b</th>\n",
       "      <th>one</th>\n",
       "      <td>0.253311</td>\n",
       "      <td>-1.107940</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>two</th>\n",
       "      <td>-1.513444</td>\n",
       "      <td>-1.038035</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              data1     data2\n",
       "key1 key2                    \n",
       "a    one   0.222108  1.335487\n",
       "     two   1.353368  0.361008\n",
       "b    one   0.253311 -1.107940\n",
       "     two  -1.513444 -1.038035"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby(['key1', 'key2']).mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们注意到第一个例子里，`df.groupby('key1').mean()`的结果里并没有key2这一列。因为`df['key2']`这一列不是数值型数据，我们称这种列为nuisance column（有碍列），这种列不会出现在结果中。默认，所有的数值型列都会被汇总计算，但是出现有碍列的情况的话，就会过滤掉这种列。\n",
    "\n",
    "一个很有用的GroupBy方法是size，会返回一个包含group size(组大小)的series："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "key1  key2\n",
       "a     one     2\n",
       "      two     1\n",
       "b     one     1\n",
       "      two     1\n",
       "dtype: int64"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby(['key1', 'key2']).size()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "另外一点需要注意的是，如果作为group key的列中有缺失值的话，也不会出现在结果中。\n",
    "\n",
    "# 1 Iterating Over Groups（对组进行迭代）\n",
    "\n",
    "GroupBy对象支持迭代，能产生一个2-tuple（二元元组），包含组名和对应的数据块。考虑下面的情况："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "a\n",
      "      data1     data2 key1 key2\n",
      "0  1.364533  0.633262    a  one\n",
      "1  1.353368  0.361008    a  two\n",
      "4 -0.920317  2.037712    a  one\n",
      "b\n",
      "      data1     data2 key1 key2\n",
      "2  0.253311 -1.107940    b  one\n",
      "3 -1.513444 -1.038035    b  two\n"
     ]
    }
   ],
   "source": [
    "for name, group in df.groupby('key1'):\n",
    "    print(name)\n",
    "    print(group)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对于有多个key的情况，元组中的第一个元素会被作为另一个元组的key值（译者：可以理解为多个key的所有组合情况）："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('a', 'one')\n",
      "      data1     data2 key1 key2\n",
      "0  1.364533  0.633262    a  one\n",
      "4 -0.920317  2.037712    a  one\n",
      "('a', 'two')\n",
      "      data1     data2 key1 key2\n",
      "1  1.353368  0.361008    a  two\n",
      "('b', 'one')\n",
      "      data1    data2 key1 key2\n",
      "2  0.253311 -1.10794    b  one\n",
      "('b', 'two')\n",
      "      data1     data2 key1 key2\n",
      "3 -1.513444 -1.038035    b  two\n"
     ]
    }
   ],
   "source": [
    "for (k1, k2), group in df.groupby(['key1', 'key2']):\n",
    "    print((k1, k2))\n",
    "    print(group)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "当然，也可以对数据的一部分进行各种操作。一个便利的用法是，用一个含有数据片段（data pieces）的dict来作为单行指令(one-liner)："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "pieces = dict(list(df.groupby('key1')))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'a':       data1     data2 key1 key2\n",
       " 0  1.364533  0.633262    a  one\n",
       " 1  1.353368  0.361008    a  two\n",
       " 4 -0.920317  2.037712    a  one, 'b':       data1     data2 key1 key2\n",
       " 2  0.253311 -1.107940    b  one\n",
       " 3 -1.513444 -1.038035    b  two}"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pieces"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>data1</th>\n",
       "      <th>data2</th>\n",
       "      <th>key1</th>\n",
       "      <th>key2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.253311</td>\n",
       "      <td>-1.107940</td>\n",
       "      <td>b</td>\n",
       "      <td>one</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-1.513444</td>\n",
       "      <td>-1.038035</td>\n",
       "      <td>b</td>\n",
       "      <td>two</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      data1     data2 key1 key2\n",
       "2  0.253311 -1.107940    b  one\n",
       "3 -1.513444 -1.038035    b  two"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pieces['b']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "groupby默认作用于axis=0，但是我们可以指定任意的轴。例如，我们可以按dtyple来对列进行分组："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "data1    float64\n",
       "data2    float64\n",
       "key1      object\n",
       "key2      object\n",
       "dtype: object"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "grouped = df.groupby(df.dtypes, axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "float64\n",
      "      data1     data2\n",
      "0  1.364533  0.633262\n",
      "1  1.353368  0.361008\n",
      "2  0.253311 -1.107940\n",
      "3 -1.513444 -1.038035\n",
      "4 -0.920317  2.037712\n",
      "object\n",
      "  key1 key2\n",
      "0    a  one\n",
      "1    a  two\n",
      "2    b  one\n",
      "3    b  two\n",
      "4    a  one\n"
     ]
    }
   ],
   "source": [
    "for dtype, group in grouped:\n",
    "    print(dtype)\n",
    "    print(group)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2 Selecting a Column or Subset of Columns (选中一列，或列的子集)\n",
    "\n",
    "如果一个GroupBy对象是由DataFrame创建来的，那么通过列名或一个包含列名的数组来对GroupBy对象进行索引的话，就相当于对列取子集做聚合（column subsetting for aggregation）。这句话的意思是：\n",
    "\n",
    "```\n",
    "df.groupby('key1')['data1'] \n",
    "df.groupby('key1')[['data2']]\n",
    "```\n",
    "上面的代码其实就是下面的语法糖（Syntactic sugar）：\n",
    "```\n",
    "df['data1'].groupby(df['key1']) \n",
    "df[['data2']].groupby(df['key1'])\n",
    "```\n",
    "> 语法糖(Syntactic sugar),是由Peter J. Landin(和图灵一样的天才人物，是他最先发现了Lambda演算，由此而创立了函数式编程)创造的一个词语，它意指那些没有给计算机语言添加新功能，而只是对人类来说更“甜蜜”的语法。语法糖往往给程序员提供了更实用的编码方式，有益于更好的编码风格，更易读。不过其并没有给语言添加什么新东西。\n",
    "\n",
    "尤其是对于一些很大的数据集，这种用法可以聚集一部分列。例如，在处理一个数据集的时候，想要只计算data2列的平均值，并将结果返还为一个DataFrame，我们可以这样写：\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>data1</th>\n",
       "      <th>data2</th>\n",
       "      <th>key1</th>\n",
       "      <th>key2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.364533</td>\n",
       "      <td>0.633262</td>\n",
       "      <td>a</td>\n",
       "      <td>one</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.353368</td>\n",
       "      <td>0.361008</td>\n",
       "      <td>a</td>\n",
       "      <td>two</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.253311</td>\n",
       "      <td>-1.107940</td>\n",
       "      <td>b</td>\n",
       "      <td>one</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-1.513444</td>\n",
       "      <td>-1.038035</td>\n",
       "      <td>b</td>\n",
       "      <td>two</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-0.920317</td>\n",
       "      <td>2.037712</td>\n",
       "      <td>a</td>\n",
       "      <td>one</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      data1     data2 key1 key2\n",
       "0  1.364533  0.633262    a  one\n",
       "1  1.353368  0.361008    a  two\n",
       "2  0.253311 -1.107940    b  one\n",
       "3 -1.513444 -1.038035    b  two\n",
       "4 -0.920317  2.037712    a  one"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>data2</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>key1</th>\n",
       "      <th>key2</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">a</th>\n",
       "      <th>one</th>\n",
       "      <td>1.335487</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>two</th>\n",
       "      <td>0.361008</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">b</th>\n",
       "      <th>one</th>\n",
       "      <td>-1.107940</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>two</th>\n",
       "      <td>-1.038035</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              data2\n",
       "key1 key2          \n",
       "a    one   1.335487\n",
       "     two   0.361008\n",
       "b    one  -1.107940\n",
       "     two  -1.038035"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby(['key1', 'key2'])[['data2']].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果一个list或一个数组被传入，返回的对象是一个分组后的DataFrame，如果传入的只是单独一个列名，那么返回的是一个分组后的grouped："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<pandas.core.groupby.SeriesGroupBy object at 0x1125309e8>"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s_grouped = df.groupby(['key1', 'key2'])['data2']\n",
    "s_grouped"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "key1  key2\n",
       "a     one     1.335487\n",
       "      two     0.361008\n",
       "b     one    -1.107940\n",
       "      two    -1.038035\n",
       "Name: data2, dtype: float64"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s_grouped.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3 Grouping with Dicts and Series（用Dicts与Series进行分组）\n",
    "\n",
    "分组信息可以不是数组的形式。考虑下面的例子："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "people = pd.DataFrame(np.random.randn(5, 5),\n",
    "                      columns=['a', 'b', 'c', 'd', 'e'],\n",
    "                      index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "people.iloc[2:3, [1, 2]] = np.nan # Add a few NA values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "      <th>d</th>\n",
       "      <th>e</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Joe</th>\n",
       "      <td>1.358054</td>\n",
       "      <td>-0.124378</td>\n",
       "      <td>0.159913</td>\n",
       "      <td>-0.006129</td>\n",
       "      <td>-1.116065</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Steve</th>\n",
       "      <td>0.926572</td>\n",
       "      <td>-0.281652</td>\n",
       "      <td>-0.586583</td>\n",
       "      <td>-0.266538</td>\n",
       "      <td>-0.216959</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Wes</th>\n",
       "      <td>0.277803</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.820144</td>\n",
       "      <td>-0.002076</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Jim</th>\n",
       "      <td>1.623214</td>\n",
       "      <td>0.109414</td>\n",
       "      <td>2.967603</td>\n",
       "      <td>0.075661</td>\n",
       "      <td>1.085864</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Travis</th>\n",
       "      <td>-0.578750</td>\n",
       "      <td>1.252605</td>\n",
       "      <td>0.757412</td>\n",
       "      <td>0.352343</td>\n",
       "      <td>-1.342396</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               a         b         c         d         e\n",
       "Joe     1.358054 -0.124378  0.159913 -0.006129 -1.116065\n",
       "Steve   0.926572 -0.281652 -0.586583 -0.266538 -0.216959\n",
       "Wes     0.277803       NaN       NaN  0.820144 -0.002076\n",
       "Jim     1.623214  0.109414  2.967603  0.075661  1.085864\n",
       "Travis -0.578750  1.252605  0.757412  0.352343 -1.342396"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "people"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "假设我们有一个组，对应多个列，而且我们想要按组把这些列的和计算出来："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "mapping = {'a': 'red', 'b': 'red', 'c': 'blue',\n",
    "           'd': 'blue', 'e': 'red', 'f': 'orange'}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "现在，我们可以通过这个dict构建一个数组，然后传递给groupby，但其实我们可以直接传入dict（可以注意到key里有一个'f'，这说明即使有，没有被用到的group key，也是ok的）："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "by_column = people.groupby(mapping, axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>blue</th>\n",
       "      <th>red</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Joe</th>\n",
       "      <td>0.153784</td>\n",
       "      <td>0.117611</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Steve</th>\n",
       "      <td>-0.853121</td>\n",
       "      <td>0.427961</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Wes</th>\n",
       "      <td>0.820144</td>\n",
       "      <td>0.275727</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Jim</th>\n",
       "      <td>3.043264</td>\n",
       "      <td>2.818492</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Travis</th>\n",
       "      <td>1.109754</td>\n",
       "      <td>-0.668541</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            blue       red\n",
       "Joe     0.153784  0.117611\n",
       "Steve  -0.853121  0.427961\n",
       "Wes     0.820144  0.275727\n",
       "Jim     3.043264  2.818492\n",
       "Travis  1.109754 -0.668541"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "by_column.sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这种用法同样适用于series，这种情况可以看作是固定大小的映射（fixed-size mapping）:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "a       red\n",
       "b       red\n",
       "c      blue\n",
       "d      blue\n",
       "e       red\n",
       "f    orange\n",
       "dtype: object"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "map_series = pd.Series(mapping)\n",
    "map_series"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>blue</th>\n",
       "      <th>red</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Joe</th>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Steve</th>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Wes</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Jim</th>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Travis</th>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        blue  red\n",
       "Joe        2    3\n",
       "Steve      2    3\n",
       "Wes        1    2\n",
       "Jim        2    3\n",
       "Travis     2    3"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "people.groupby(map_series, axis=1).count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4 Grouping with Functions（用函数进行分组）\n",
    "\n",
    "比起用dict火series定义映射关系，使用python的函数是更通用的方法。任何一个作为group key的函数，在每一个index value（索引值）上都会被调用一次，函数计算的结果在返回的结果中会被用做group name。更具体一点，考虑前一个部分的DataFrame，用人的名字作为索引值。假设我们想要按照名字的长度来分组；同时我们要计算字符串的长度，使用len函数会变得非常简单："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "      <th>d</th>\n",
       "      <th>e</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3.259071</td>\n",
       "      <td>-0.014964</td>\n",
       "      <td>3.127516</td>\n",
       "      <td>0.889676</td>\n",
       "      <td>-0.032277</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0.926572</td>\n",
       "      <td>-0.281652</td>\n",
       "      <td>-0.586583</td>\n",
       "      <td>-0.266538</td>\n",
       "      <td>-0.216959</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>-0.578750</td>\n",
       "      <td>1.252605</td>\n",
       "      <td>0.757412</td>\n",
       "      <td>0.352343</td>\n",
       "      <td>-1.342396</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          a         b         c         d         e\n",
       "3  3.259071 -0.014964  3.127516  0.889676 -0.032277\n",
       "5  0.926572 -0.281652 -0.586583 -0.266538 -0.216959\n",
       "6 -0.578750  1.252605  0.757412  0.352343 -1.342396"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "people.groupby(len).sum() # len函数在每一个index（即名字）上被调用了"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "混合不同的函数、数组，字典或series都不成问题，因为所有对象都会被转换为数组："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "key_list = ['one', 'one', 'one', 'two', 'two']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "      <th>d</th>\n",
       "      <th>e</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">3</th>\n",
       "      <th>one</th>\n",
       "      <td>0.277803</td>\n",
       "      <td>-0.124378</td>\n",
       "      <td>0.159913</td>\n",
       "      <td>-0.006129</td>\n",
       "      <td>-1.116065</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>two</th>\n",
       "      <td>1.623214</td>\n",
       "      <td>0.109414</td>\n",
       "      <td>2.967603</td>\n",
       "      <td>0.075661</td>\n",
       "      <td>1.085864</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <th>one</th>\n",
       "      <td>0.926572</td>\n",
       "      <td>-0.281652</td>\n",
       "      <td>-0.586583</td>\n",
       "      <td>-0.266538</td>\n",
       "      <td>-0.216959</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <th>two</th>\n",
       "      <td>-0.578750</td>\n",
       "      <td>1.252605</td>\n",
       "      <td>0.757412</td>\n",
       "      <td>0.352343</td>\n",
       "      <td>-1.342396</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              a         b         c         d         e\n",
       "3 one  0.277803 -0.124378  0.159913 -0.006129 -1.116065\n",
       "  two  1.623214  0.109414  2.967603  0.075661  1.085864\n",
       "5 one  0.926572 -0.281652 -0.586583 -0.266538 -0.216959\n",
       "6 two -0.578750  1.252605  0.757412  0.352343 -1.342396"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "people.groupby([len, key_list]).min()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5 Grouping by Index Levels （按索引层级来分组）\n",
    "\n",
    "最后关于多层级索引数据集(hierarchically indexed dataset)，一个很方便的用时是在聚集（aggregate）的时候，使用轴索引的层级（One of the levels of an axis index）。看下面的例子："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "MultiIndex(levels=[['JP', 'US'], [1, 3, 5]],\n",
       "           labels=[[1, 1, 1, 0, 0], [0, 1, 2, 0, 1]],\n",
       "           names=['cty', 'tenor'])"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'], \n",
    "                                     [1, 3, 5, 1, 3]], \n",
    "                                    names=['cty', 'tenor'])\n",
    "columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th>cty</th>\n",
       "      <th colspan=\"3\" halign=\"left\">US</th>\n",
       "      <th colspan=\"2\" halign=\"left\">JP</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>tenor</th>\n",
       "      <th>1</th>\n",
       "      <th>3</th>\n",
       "      <th>5</th>\n",
       "      <th>1</th>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-0.898073</td>\n",
       "      <td>0.156686</td>\n",
       "      <td>-0.151011</td>\n",
       "      <td>0.423881</td>\n",
       "      <td>0.336215</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.736301</td>\n",
       "      <td>0.901515</td>\n",
       "      <td>0.081655</td>\n",
       "      <td>0.450248</td>\n",
       "      <td>-0.031245</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-1.619125</td>\n",
       "      <td>-1.041775</td>\n",
       "      <td>0.129422</td>\n",
       "      <td>1.222881</td>\n",
       "      <td>-0.717410</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.998536</td>\n",
       "      <td>-1.373455</td>\n",
       "      <td>1.724266</td>\n",
       "      <td>-2.084529</td>\n",
       "      <td>0.535651</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "cty          US                            JP          \n",
       "tenor         1         3         5         1         3\n",
       "0     -0.898073  0.156686 -0.151011  0.423881  0.336215\n",
       "1      0.736301  0.901515  0.081655  0.450248 -0.031245\n",
       "2     -1.619125 -1.041775  0.129422  1.222881 -0.717410\n",
       "3      0.998536 -1.373455  1.724266 -2.084529  0.535651"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "hier_df = pd.DataFrame(np.random.randn(4, 5), columns=columns)\n",
    "hier_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "要想按层级分组，传入层级的数字或者名字，通过使用level关键字："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>cty</th>\n",
       "      <th>JP</th>\n",
       "      <th>US</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "cty  JP  US\n",
       "0     2   3\n",
       "1     2   3\n",
       "2     2   3\n",
       "3     2   3"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "hier_df.groupby(level='cty', axis=1).count()"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [py35]",
   "language": "python",
   "name": "Python [py35]"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
